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Abstract 



In this note, we present a new averaging technique for the projected stochastic 
subgradient method. By using a weighted average with a weight oi t + 1 for each 
iterate Wt at iteration t, we obtain the convergence rate of 0{l/t) with both an easy 
proof and an easy implementation. The new scheme is compared empiricaUy to existing 
techniques, with similar performance behavior. 

1 Introduction 

We consider a strongly convex function / defined on a convex set K. We denote by ^ its 
strong convexity constant. Following [1, 2, 3, 4], we consider a stochastic approximation 
scenario where only unbiased estimates of subgradients of / are available, with the projected 
stochastic subgradient method. 

More precisely, we assume that we have an increasing sequence of cr-fields (J^t)t^Oi such that 
wq & K \s Jxi-™easurable and such that for all t ^ 1, 



(a) Hk is the orthogonal projection on K, 

(b) K{gt\Tt-i) is almost surely a subgradient of / at wt-i (which we denote f'{wt~i)), 

(c) E(||g'(|p) ^ (finite variance condition). 

We denote by w* the unique minimizer of / on i^. 




(1) 



where 
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2 Motivating example 



Our main motivating example is the support vector machine (SVM) and its structured pre- 
diction extensions [5, 6, 7], where the pairs (xt, yt) for t ^ 1 are independent and identicaUy 
distributed and f{w) = K£{y,w'^x) + where £{y,u) is a Lipschitz-continuous convex 

loss function (with respect to the second variable) and K is the whole space (unconstrained 
setup). We then have gt = i'{yt,wJ_iXt)xt + pLWt-i, where £'{y,u) denotes any subgradient 
with respect to the second variable. 

If we make the additional assumption that IE||x|p is finite, then this setup satisfies the 
assumptions above with = 4L|E||x|p, where is the Lipschitz constant for £. We show 
this bound in Appendix A. 

Alternatively, we can consider X to be a compact convex subset. This is used in particular 
in a projected version of the stochastic subgradient method for SVM in [1]. In this case, we 
can take = (L^ y^EHxP + fi max^gi^ ll^^ll)^- 



3 Convergence analysis 

Following standard proof techniques [1, 2], we have: 

\\wt — w*\\'^ ^ — jtgt — because orthogonal projections contract distances, 

= \\wt-i - + 7t^ll5t|P - 27t('u;t_i - w*)~^gt 
E{\\wt - w*f\Tt-i) ^ Wwt^i - w*f + -ffE{\\gtf\Tt^i) - 2jt{wt.i - w*)'^ f'{wt^i) 

^ Wwt^i - w*f + -fM\\9tf\^t-i) - 2-/t[f{wt^i) - fiw*) + f^\\wt-i - w*f] 

The last inequality is obtained from the /x-strong convexity of /. Thus, by re-arranging the 
function values on the LHS and taking expectations on both sides, we get: 



2jt[Ef{wt-i) - f{w*)] ^ iMatf + (1 - f^7t)nwt-i - w*f - E\\wt - w*f 
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Ef{wt-i)-fiw*) < 7^ + ltl-^E\\wt-i-w*f-^E\\wt-w*\\\ (2) 



3.1 Classical analysis 

With 7j = — , then inequality (2) becomes 

Efiwt-i) - f{w*) ^ 1^ + ^^^^^E\\wt-i - w*f - ^E\\wt - w*f, 
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and by summing from i = 1 to t = T, we obtain: 

/I ^ \ 1 ^ 

Efi-^wt-i)-fiw*) ^ Y.Ef{wt-i)-f{ 

^ t=i ^ t=\ 



52 ^ 1 11 



€ — - + ro-rE||i(;T 



1^(1+ log T). 



The first line used the convexity of /; the second line is obtained from a telescoping sum. 
We also obtain W(wt - w'f' ^ ^^^z^fl + logT). 



3.2 New analysis 
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With 7t 



and multiplying inequality (2) by i, we obtain: 



l)E||t(;t_i -w*f-t(t + \)&\wt-w 



* Il2 



^ — + 7 



t(t - l)E||'u;t_i - w 



* l|2 



* l|2 



By summing from i = 1 to t = T these t-weighted inequalities, we obtain a similar telescop- 
ing sum, but this time the term with stays constant across the sum: 



^t[E/(^«i_i)-/K)] 



t=\ 



/i 4 



o-r(r + i)E||t(7T-ti^ 



* Il2 



(3) 



Thus 



/^^n_.. . *„2 - 



f(w*) + -E\\wT-w*r ,^ 



which implies 



and 



E\\WT-W*f ^ 



4S^ 



2B' 



^2(T + 1)- 



So by using the weighted average wt = -jY+rfiT+i) '^'t=oi^ + ^)wt instead of a uniform 
average, we get a 0{^) rate instead of 0{^^^). Note that these averaging schemes are 
efficiently implemented in an online fashion as: 



wt = {I- pt)wt-i + ptwt. 



(4) 
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Figure 1: Comparison of optimization strategies for support vector machine objective. Top 
from to right: quantum, protein, and sido data sets. Bottom from left to right: rcvl, 
covertype, and news data sets. This figure is best viewed in colour. 

For the proposed weighted averaging scheme, pt = 2/{t + 2) (compare with pt = l/{t + 1) 
for the uniform averaging scheme). 



4 Experiments 

To test the empirical performance of the averaging scheme, we performed a series of exper- 
iments using the support vector machine optimization problem 

\ 1 " 

II 1 1 2 X — ^ r T 1 

mm — H — > maxjO, 1 — y-jW Xjj, 

i=l 

where Xi is in an Euclidean space and yi £ {—1, 1}. 

We performed experiments on a set of freely available benchmark binary classification data 
sets. The quantum (n = 50000, p = 78) and protein (n = 145751, p = 74) data sets were 
obtained from the KDD Cup 2004 website,^ the sido data set (n = 12678, p = 4932) was 
obtained from the Causality Workbench website,^ while the rcvl (n = 20242, p = 47236), 
covertype (n = 581012, p = 54), and news (n = 19996, p = 1355191) data sets were obtained 
from the LIBSVM data website."^ We added a (regularized) bias term to all data sets, and 
for dense features we standardized so that they would have a mean of zero and a variance 

^http : / / osmot . cs . Cornell . edu/kddcup 

^http : / /www . causality . inf . ethz . ch/home . php 

^http : //www. csie .ntu. edu. tw/~cjlin/libsvmtools/datasets 
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of one. We set the regularization parameter A to 1/n, although we found that the relative 
performance of the methods was not particularly sensitive to this choice. We didn't use 
any projection [K is the whole space). Our experiments compared the following averaging 
strategies: 

- 0: No averaging. 

- 1: Averaging all iterates with uniform weight. 

- 0.5: Averaging the second half of the iterates with uniform weight, as proposed in [4]. 

- D: Averaging all iterates since the last iteration that was a power of 2 with uniform 
weight (the 'doubling trick'), also proposed in [4]. 

- W: Averaging all iterates with a weight of f + 1, as discussed in this note. 

- W^: Averaging all iterates with a weight of (t + 1)^, which puts even further emphasis 
on recent iterations. 

We plot the performance of these different averaging strategies in Figure 1, which shows the 
objective function against the number of effective passes through the data (the number of 
iterations divided by n) . This figure uses a step size of for all methods as we found this 
gave better performance than a step size of 2/^(t + l), although we include the performance 
of W with the latter step-size for comparison. In Figure 1, we observe the following trends: 

- 0: Not averaging at all is typically among the worst strategies. However, this proved 
to be the best strategy on the sido data set. This may be because the method is still 
far from the solution after 50 passes through the data. 

- 1: Uniform averaging of all iterates is always the worst strategy. 

- 0.5: Uniform averaging of the second half of the iterates is typically among the best 
strategies, provided we are in fact in the second half of the iterates. 

- D: The doubling trick typically gave among the best performance across the methods. 

- W: The proposed weighting typically performed between the doubling trick and not 
averaging. 

- W^: Weighting the iterates by {t + 1)^ always outperformed weighting them by t + 1. 

5 Discussion 

- We note that the averaging of linear approximations of / (rather than the iterates) 
by i + 1 is also used in the optimization strategy of Nesterov [8], which achieves 
an optimal 0{l/t^) convergence rate for optimizing (deterministic) objectives with 
Lipschitz-continuous gradients (see step 3 for their Equation 3.11). 
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- There are previous approaches to removing the \ogt term [3, 4], but the one presented 
in this note is arguably somewhat simpler to implement and analyze. Rakhlin et al. 
propose in [4] the '1/2-suffix averaging' scheme (and the 'doubling trick' that we used 
in the experiments). Their proof technique requires separately bounding E||ri;t — tu* |p 
and then controlling the sum of inequalities in (2) by using that only the last half 
of the iterations is averaged (the '1/2-suffix'). Hazan and Kale propose in [3] the 
epoch-GD scheme, which uses a similar averaging schedule as in the 'doubling trick' 
of [4], but using a fixed step-size within each geometrically sized 'epoch' of averaging, 
as well as using the previous average as the initialization for an epoch. 

- We note that all the schemes presented in the experiments can have their convergence 
rate proven. Schemes and 1 have 0{{logt)/t) rate whereas the schemes 0.5, D, W 
and have 0{l/t) rate. We can show the 0{l/t) rate for general weighted averaging 
schemes (with weight t'^ for iterate t for some fixed A: > 1) as well as step-sizes of the 
form 7i = c/{t + b) for c > 1/2 and 6 > 0. The proof becomes longer though as the 
nice telescoping sum in (3) doesn't cancel out in these cases. One has to use instead a 
bound on ¥,\\wt — vf\^ such as in Lemma 1 in [4] to control the non-canceling terms. 
The overall rate is still 0(l/t), but with different constants depending on c and k. 

- At the same time that we first posted this note, Shamir and Zhang independently 
proposed a similar weighted average scheme in [9] which they call 'polynomial-decay 
averaging'. They consider a running average scheme as in (4), but with the more 
general pt = j^^+^i where the integer > parameterizes the different schemes.^ 
7y = yields the standard uniform averaging scheme, whereas r/ = 1 yields the simple 
weighted average analyzed in Section 3.2. The general r/ gives a weight of 0{t^) for 
each iterate, similar to what was mentioned in the previous paragraph, but with a 
different exact formula. They provide in [9] a proof of a rate of 0(1 /t) for i] > 2. 
The proof that we give in Section 3.2 can be seen as complementary and is especially 
much simpler (as well as giving a tighter constant). We also note that the rate of 
O {(log t)/t) for the last iterate wt (scheme above) is proven for the first time in [9]. 

- While this paper focuses on the non-smooth case, it is still interesting to relate results 
to the smooth case (see, e.g., [10] and references therein), where in the strongly convex 
case, averaging with longer step sizes — i.e., of the form t~" with a G (1/2, 1) — leads to 
better and more robust rates. Can larger step sizes improve results for the non-smooth 
case? 

A Finite variance bound for SVM 

We derive here the finite variance bound E||c/t||2 < 4L|E||x||2 = for the general SVM- 
like objective considered in Section 2 and update rule (1). To see this, we consider the 
more general case of f{w) = Kh{z,w) + where h{z,w) is convex in w for each z 

''We note that the index t is shifted by one between this note and their paper as their initial point is wi 
whereas ours is Wq- We also note that they use the misnomer 'gradient descent' for their algorithm despite 
using subgradients which don't necessarily yield a descent direction. 
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(for SVM, z = {x,y) and h{z,w) = £{y,w x)). We make a Lipschitz-like (in expectatfon) 
assumption on h that E||/i'(z, if) |p < L^, where h'{z,w) denotes any subgradient with 
respect to the second variable (note that = L'jK\\x\\'^ for SVM). With gt = h'{zt,wt-i) + 
HWt-i, Leibniz rule yields K{gt\wt-i) = f'{wt-i), as required by our setup (see (1.3) in [2] 
for some regularity conditions for this to be true). Given this definition of gt, we use the 
Minkowski inequality on the norm function^ to get: 

VMatW^ < V^\\h'{zt,wt-iW + f^VMm-iP <L + fiy/E\\wt-i\\^. 

We can then obtain the required bound of (2L)^ on E||gf|p by showing that y^E||u)t_iP < 
L/n- This can easily be proven by induction, with the assumption that jt < 1/m either 
7i = or Y^E||ii;oP < L/ ^ (these assumptions are satisfied by the step sizes considered 
in this note). To see this, we use the subgradient update (1) applied to this form of f{w): 

wt = (1 - n-it)wt-i - -/th'{zt,wt-i). 

Applying Minkowski inequality again, we get 

VMwtW^ < (1 - /i7t)VlE|kt-iP + 7tVnh'{zt,wt-i\\^ 
< (1 - /i7t)\/E||u;t_iP + /i7t-. 

The first line above used the assumption that 74 < to ensure that (1 — fi'yt) is non- 
negative. The assumption 71 = 1/;U or y^E||uJoIP ^ L/ fj, then yields the base case of i = 1. 
Plugging in the induction hypothesis then yields: 

VEll'WtP < (1 - ^7i)- + ^7i- = -, 

which completes the proof. 
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